Add test sharding, proactive clean, and retry logic for self-hosted CI#1171
Add test sharding, proactive clean, and retry logic for self-hosted CI#1171sbryngelson wants to merge 9 commits intoMFlowCode:masterfrom
Conversation
- Shard Frontier GPU tests into 2 parts for faster parallel execution - Add proactive ./mfc.sh clean in Phoenix test scripts to prevent cross-compiler contamination from stale build artifacts - Add --requeue to Phoenix SLURM jobs for preemption recovery - Add lint-gate job that must pass before self-hosted tests run - Add retry logic for GitHub runner tests (retry <=5 failures) - Add Frontier AMD test support with dedicated submit/test scripts - Restructure self-hosted matrix with explicit cluster names Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
|
CodeAnt AI is reviewing your PR. Thanks for using CodeAnt! 🎉We're free for open-source projects. if you're enjoying it, help us grow by sharing. Share on X · |
|
Note Reviews pausedIt looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the Use the following commands to manage reviews:
Use the checkboxes below for quick actions:
📝 WalkthroughWalkthroughAdds test sharding across CI and cluster jobs, propagates shard arguments through GitHub Actions and SLURM submit/test scripts, updates SLURM resource/QOS settings for Frontier clusters, adds retry and failed-UUID handling, and introduces CLI/test-suite shard filtering and failure persistence. Changes
Sequence Diagram(s)sequenceDiagram
participant GH as GitHub Actions
participant Runner as CI Runner
participant Submit as submit.sh
participant SLURM as Scheduler
participant TestSh as mfc.sh / Test Runner
participant Tests as Test Suite
GH->>Runner: start job (matrix includes shard)
Runner->>Submit: run submit.sh with shard arg
Submit->>SLURM: sbatch (env: JOB_SHARD)
SLURM->>TestSh: allocate nodes, start job
TestSh->>TestSh: compute shard_opts from JOB_SHARD
TestSh->>Tests: run tests with --shard (shard_opts)
Tests->>TestSh: failed UUIDs (write failed_uuids.txt)
TestSh->>GH: exit status, artifacts/logs (archive)
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~25 minutes Possibly related PRs
Suggested labels
Poem
🚥 Pre-merge checks | ✅ 2 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (2 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
Nitpicks 🔍
|
|
CodeAnt AI finished reviewing your PR. |
There was a problem hiding this comment.
Pull request overview
This PR enhances the self-hosted CI infrastructure with test sharding, proactive cleanup, and retry mechanisms to improve reliability and reduce execution time. It addresses cross-compiler contamination issues on persistent runners and enables faster parallel test execution on batch partition systems.
Changes:
- Add retry logic for GitHub runner tests (≤5 failures trigger automatic retest)
- Shard Frontier GPU tests into 2 parallel jobs for faster execution
- Add proactive
./mfc.sh cleanto Phoenix test scripts - Add
--requeueflag to Phoenix SLURM jobs for preemption recovery - Wrap Frontier build steps in retry action with automatic cleanup
- Update Frontier SLURM configuration (account, partition, timeout, QOS)
Reviewed changes
Copilot reviewed 7 out of 7 changed files in this pull request and generated 6 comments.
Show a summary per file
| File | Description |
|---|---|
| .github/workflows/test.yml | Add retry logic for ≤5 test failures, add shard parameter to matrix, wrap builds in retry action, remove deprecated environment variables |
| .github/workflows/phoenix/test.sh | Add proactive ./mfc.sh clean to prevent cross-compiler contamination |
| .github/workflows/phoenix/submit.sh | Add --requeue flag for automatic preemption recovery |
| .github/workflows/frontier/test.sh | Add shard parameter handling for test splitting |
| .github/workflows/frontier/submit.sh | Update SLURM config (account, partition, timeout, QOS) and add shard parameter |
| .github/workflows/frontier_amd/test.sh | Add shard parameter handling for test splitting |
| .github/workflows/frontier_amd/submit.sh | Update SLURM config (account, partition, timeout, QOS) and add shard parameter |
There was a problem hiding this comment.
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (2)
.github/workflows/test.yml (1)
265-274:⚠️ Potential issue | 🟡 MinorLog file references don't account for shard — will break if job_slug is fixed.
test-${{ matrix.device }}-${{ matrix.interface }}.outon line 267 assumes the output filename doesn't include a shard suffix. This is currently consistent with the submit scripts, but if the job_slug collision (flagged onfrontier_amd/submit.sh) is fixed by incorporating the shard, these references must be updated in tandem.Also, the artifact
nameon line 273 doesn't include shard, which could cause upload conflicts for sharded matrix entries with the same device/interface (e.g., twogpu-accfrontier shards).strategy.job-indexmakes it unique, but adding shard would improve clarity.Proposed fix (apply after fixing job_slug in submit scripts)
- name: Print Logs if: always() - run: cat test-${{ matrix.device }}-${{ matrix.interface }}.out + run: cat test-${{ matrix.device }}-${{ matrix.interface }}${{ matrix.shard != '' && format('-{0}', matrix.shard) || '' }}.out - name: Archive Logs uses: actions/upload-artifact@v4 if: matrix.cluster != 'phoenix' with: - name: logs-${{ strategy.job-index }}-${{ matrix.device }}-${{ matrix.interface }} + name: logs-${{ strategy.job-index }}-${{ matrix.device }}-${{ matrix.interface }}${{ matrix.shard != '' && format('-{0}', matrix.shard) || '' }} - path: test-${{ matrix.device }}-${{ matrix.interface }}.out + path: test-${{ matrix.device }}-${{ matrix.interface }}${{ matrix.shard != '' && format('-{0}', matrix.shard) || '' }}.outNote: The shard value contains
/(e.g.,1/2) which is invalid in filenames. The submit script slug sanitization would need to handle this (e.g., replace/with-of-), and the workflow expressions here would need to match.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In @.github/workflows/test.yml around lines 265 - 274, Update the Print Logs and Archive Logs steps so the logfile and artifact name include the shard-aware slug used by the submit scripts (instead of assuming test-${{ matrix.device }}-${{ matrix.interface }}.out). Locate the "Print Logs" and "Archive Logs" steps and change the referenced filename and artifact name to incorporate the sanitized job slug/shard token produced by the submit scripts (the same slug that replaces "/" with a safe separator such as "-of-"); ensure the workflow expression that builds the filename and the artifact "name" use that sanitized slug so filenames and artifact names remain unique and valid across sharded jobs..github/workflows/frontier_amd/submit.sh (1)
31-32:⚠️ Potential issue | 🔴 CriticalJob slug does not include shard — SLURM output files collide when sharded tests run concurrently.
When multiple shards for the same
device/interfacepair run on the same HPC cluster, they produce identicaljob_slugvalues (e.g.,test-gpu-accfor both shard1/2and2/2), resulting in identicaloutput_filenames. Since both SLURM jobs execute from the sameSLURM_SUBMIT_DIR, one job's output will silently overwrite the other's. This affects both.github/workflows/frontier/submit.shand.github/workflows/frontier_amd/submit.shat line 31.Incorporate the shard into the slug:
Proposed fix
-job_slug="`basename "$1" | sed 's/\.sh$//' | sed 's/[^a-zA-Z0-9]/-/g'`-$2-$3" +shard_suffix="" +if [ -n "$4" ]; then + shard_suffix="-$(echo "$4" | sed 's|/|-of-|')" +fi +job_slug="`basename "$1" | sed 's/\.sh$//' | sed 's/[^a-zA-Z0-9]/-/g'`-$2-$3${shard_suffix}"Additionally, update
.github/workflows/test.ymlline 267 and 273 to account for the shard suffix:
- Line 267:
cat test-${{ matrix.device }}-${{ matrix.interface }}.out→cat test-${{ matrix.device }}-${{ matrix.interface }}${{ matrix.shard != '' && format('-{0}', matrix.shard) || '' | replace('/', '-of-') }}.out- Line 273 artifact name: include shard suffix to match
The usage messages in both scripts (line 9) should also be updated to document the
interfaceandshardparameters.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In @.github/workflows/frontier_amd/submit.sh around lines 31 - 32, The job_slug currently built by job_slug and used for output_file omits the shard, causing name collisions; update the job_slug generation (the job_slug variable and any references to output_file) to append the shard identifier (formatting the shard like "-{shard}" and replacing "/" with "-of-" for values like "1/2") so each shard produces a unique slug; also update the script usage message (the usage text near the top that lists parameters) to document the interface and shard parameters, and update the workflow steps that read and upload artifacts (the cat command that reads test-${matrix.device}-${matrix.interface}.out and the artifact name) to include the same shard suffix formatting so artifact names and printed output match the new job_slug convention.
🧹 Nitpick comments (1)
.github/workflows/frontier_amd/submit.sh (1)
8-9: Usage message is outdated — does not document the interface or shard arguments.The script accepts up to 4 positional arguments (
$1=script,$2=device,$3=interface,$4=shard), but the usage string only mentions the first two.Proposed fix
usage() { - echo "Usage: $0 [script.sh] [cpu|gpu]" + echo "Usage: $0 [script.sh] [cpu|gpu] [none|acc|omp] [shard]" }🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In @.github/workflows/frontier_amd/submit.sh around lines 8 - 9, The usage() function's message is outdated and only mentions two arguments; update the echo in usage() to document all supported positional params ($1 script, $2 device (cpu|gpu), $3 interface, $4 shard) and any defaults or optional markers (e.g., "[interface]" "[shard]") so callers see the full signature; edit the echo inside usage() to a single clear line listing script, device, interface, and shard and optional/default semantics.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Outside diff comments:
In @.github/workflows/frontier_amd/submit.sh:
- Around line 31-32: The job_slug currently built by job_slug and used for
output_file omits the shard, causing name collisions; update the job_slug
generation (the job_slug variable and any references to output_file) to append
the shard identifier (formatting the shard like "-{shard}" and replacing "/"
with "-of-" for values like "1/2") so each shard produces a unique slug; also
update the script usage message (the usage text near the top that lists
parameters) to document the interface and shard parameters, and update the
workflow steps that read and upload artifacts (the cat command that reads
test-${matrix.device}-${matrix.interface}.out and the artifact name) to include
the same shard suffix formatting so artifact names and printed output match the
new job_slug convention.
In @.github/workflows/test.yml:
- Around line 265-274: Update the Print Logs and Archive Logs steps so the
logfile and artifact name include the shard-aware slug used by the submit
scripts (instead of assuming test-${{ matrix.device }}-${{ matrix.interface
}}.out). Locate the "Print Logs" and "Archive Logs" steps and change the
referenced filename and artifact name to incorporate the sanitized job
slug/shard token produced by the submit scripts (the same slug that replaces "/"
with a safe separator such as "-of-"); ensure the workflow expression that
builds the filename and the artifact "name" use that sanitized slug so filenames
and artifact names remain unique and valid across sharded jobs.
---
Duplicate comments:
In @.github/workflows/frontier/submit.sh:
- Around line 31-32: job_slug and output_file are colliding for parallel shards
because they only use basename("$1") with $2 and $3; update the job_slug
generation (and derived output_file) to include an additional unique shard
identifier (for example a shard index/ID passed as another script argument or a
runtime value like the process/array task id) so each shard produces a distinct
job_slug and output_file; change the construction that sets job_slug and the
assignment of output_file to append that unique identifier.
---
Nitpick comments:
In @.github/workflows/frontier_amd/submit.sh:
- Around line 8-9: The usage() function's message is outdated and only mentions
two arguments; update the echo in usage() to document all supported positional
params ($1 script, $2 device (cpu|gpu), $3 interface, $4 shard) and any defaults
or optional markers (e.g., "[interface]" "[shard]") so callers see the full
signature; edit the echo inside usage() to a single clear line listing script,
device, interface, and shard and optional/default semantics.
There was a problem hiding this comment.
1 issue found across 7 files
Confidence score: 4/5
- Moderate risk only: the cleanup step in
.github/workflows/phoenix/test.shdoesn’t check the./mfc.sh cleanexit status, so failures could allow stale artifacts to affect builds/tests. - This is a CI reliability concern rather than a direct product bug, so it’s likely safe to merge with minimal risk.
- Pay close attention to
.github/workflows/phoenix/test.sh- ensure cleanup failures don’t silently proceed.
Prompt for AI agents (all issues)
Check if these issues are valid — if so, understand the root cause of each and fix them. If appropriate, use sub-agents to investigate and fix each issue separately.
<file name=".github/workflows/phoenix/test.sh">
<violation number="1" location=".github/workflows/phoenix/test.sh:5">
P2: The `./mfc.sh clean` exit status is not checked. If the clean fails, the script continues and may build/test against stale or corrupted artifacts, defeating the purpose of this proactive cleanup and causing hard-to-diagnose failures.</violation>
</file>
Reply with feedback, questions, or to request a fix. Tag @cubic-dev-ai to re-run a review.
The CI test scripts use --shard for splitting Frontier GPU tests across multiple jobs, and failed_uuids.txt for retry logic. These toolchain changes were missing from the cherry-pick. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
There was a problem hiding this comment.
Actionable comments posted: 1
🧹 Nitpick comments (1)
toolchain/mfc/test/test.py (1)
102-108: Shard filtering is correct — minor readability nit on line 104.The validation logic handles all edge cases correctly (short-circuit
orensuresint()is never called on non-digit strings), andi % shard_count == shard_idx - 1correctly partitions cases without overlap. The placement after all other filters and before--percentis the right ordering.Optional: the compound condition on line 104 can be split into guard clauses to improve readability:
♻️ Optional readability refactor
- if len(parts) != 2 or not all(p.isdigit() for p in parts) or int(parts[1]) < 1 or not 1 <= int(parts[0]) <= int(parts[1]): - raise MFCException(f"Invalid --shard '{ARG('shard')}': expected 'i/n' with 1 <= i <= n (e.g., '1/2').") + def _bad_shard(): + if len(parts) != 2 or not all(p.isdigit() for p in parts): + return True + n, i = int(parts[1]), int(parts[0]) + return n < 1 or not (1 <= i <= n) + if _bad_shard(): + raise MFCException(f"Invalid --shard '{ARG('shard')}': expected 'i/n' with 1 <= i <= n (e.g., '1/2').")🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@toolchain/mfc/test/test.py` around lines 102 - 108, The compound validation in the ARG("shard") block is correct but hard to read; refactor the conditional inside the if ARG("shard") is not None: block by splitting the long compound condition into explicit guard checks: first split = ARG("shard").split("/") and verify length == 2, then check that both parts are digits (using parts[0].isdigit() and parts[1].isdigit()), then parse shard_idx = int(parts[0]) and shard_count = int(parts[1]) and validate shard_count >= 1 and 1 <= shard_idx <= shard_count; on any failure raise MFCException with the same message, then compute skipped_cases and cases using shard_idx and shard_count as before.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@toolchain/mfc/test/test.py`:
- Around line 217-224: When abort_tests.is_set() causes test() to raise
MFCException the existing cleanup that writes/removes failed_uuids.txt (the
failed_uuids_path handling around failed_tests, open(...), os.remove(...)) is
skipped; modify the exception/exit path to always attempt to remove stale
failed_uuids_path and wrap file I/O (open and os.remove) in try/except catching
OSError (or Exception) so I/O errors are logged but do not replace the real exit
code—i.e., in the MFCException handler and/or finally block ensure you try to
delete failed_uuids_path if it exists and handle/log any OSError from
open()/os.remove() instead of letting it propagate.
---
Nitpick comments:
In `@toolchain/mfc/test/test.py`:
- Around line 102-108: The compound validation in the ARG("shard") block is
correct but hard to read; refactor the conditional inside the if ARG("shard") is
not None: block by splitting the long compound condition into explicit guard
checks: first split = ARG("shard").split("/") and verify length == 2, then check
that both parts are digits (using parts[0].isdigit() and parts[1].isdigit()),
then parse shard_idx = int(parts[0]) and shard_count = int(parts[1]) and
validate shard_count >= 1 and 1 <= shard_idx <= shard_count; on any failure
raise MFCException with the same message, then compute skipped_cases and cases
using shard_idx and shard_count as before.
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## master #1171 +/- ##
=======================================
Coverage 44.05% 44.05%
=======================================
Files 70 70
Lines 20498 20498
Branches 1990 1990
=======================================
Hits 9030 9030
Misses 10329 10329
Partials 1139 1139 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
|
CodeAnt AI is running Incremental review Thanks for using CodeAnt! 🎉We're free for open-source projects. if you're enjoying it, help us grow by sharing. Share on X · |
|
CodeAnt AI Incremental review completed. |
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In @.github/workflows/test.yml:
- Around line 146-166: The workflow is triggering shellcheck/ actionlint SC2046
on unquoted command substitution for nproc; update both invocations of -j
$(nproc) in the test run commands (the two calls to "/bin/bash mfc.sh test -v
... -j $(nproc) ..." and the retry call) to quote the substitution as -j
"$(nproc)"; leave $TEST_ALL, $TEST_PCT and the intentionally unquoted $FAILED
as-is and ensure the surrounding logic using TEST_EXIT, failed_uuids.txt and the
retry branch remains unchanged.
- Clean up failed_uuids.txt on early abort path so CI doesn't retry stale UUIDs from a previous run - Guard retry condition with NUM_FAILED > 0 to prevent full-suite rerun when the file exists but is empty - Quote $(nproc) to silence shellcheck SC2046 warnings Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
The build system should handle compiler changes correctly. Proactive clean forces full rebuilds of FFTW/LAPACK from scratch every run, which is slow and exposes builds to transient filesystem failures (CMake TryCompile errors on Phoenix scratch). Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
There was a problem hiding this comment.
Actionable comments posted: 1
🧹 Nitpick comments (3)
toolchain/mfc/test/test.py (1)
104-104: Preferstr.isdecimal()overstr.isdigit()for the digit guard.
str.isdigit()returnsTruefor Unicode super/subscript characters like"²", butint("²")raises aValueError. This creates a gap: a value like"²/2"passes the guard but crashes onint(parts[0]).str.isdecimal()limits acceptance to[0-9], closing the gap.♻️ Proposed fix
- if len(parts) != 2 or not all(p.isdigit() for p in parts) or int(parts[1]) < 1 or not 1 <= int(parts[0]) <= int(parts[1]): + if len(parts) != 2 or not all(p.isdecimal() for p in parts) or int(parts[1]) < 1 or not 1 <= int(parts[0]) <= int(parts[1]):🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@toolchain/mfc/test/test.py` at line 104, The guard currently uses str.isdigit() on elements of parts in the conditional (the expression containing "len(parts) != 2 or not all(p.isdigit() for p in parts) ..."), which allows non-ASCII numeric characters that int() can't parse; update the predicate to use str.isdecimal() instead (i.e., replace the all(p.isdigit() for p in parts) check with all(p.isdecimal() for p in parts)) so only ASCII digits [0-9] are accepted before calling int() on parts[0] and parts[1]..github/workflows/test.yml (2)
277-282: Artifact name omits the shard — consider adding it for easier log identification.Currently
logs-${{ strategy.job-index }}-${{ matrix.device }}-${{ matrix.interface }}is unique across matrix entries (thanks tostrategy.job-index), but the shard isn't visible in the artifact name. When debugging a sharded run it's helpful to know at a glance whether the log came from shard1/2or2/2. A minor ergonomics improvement:♻️ Proposed name including shard
- name: logs-${{ strategy.job-index }}-${{ matrix.device }}-${{ matrix.interface }} + name: logs-${{ strategy.job-index }}-${{ matrix.device }}-${{ matrix.interface }}${{ matrix.shard != '' && format('-shard-{0}', matrix.shard) || '' }}Note: GitHub artifact names cannot contain
/, so you'll need to sanitize the shard value (e.g., replace/with-):+ name: logs-${{ strategy.job-index }}-${{ matrix.device }}-${{ matrix.interface }}${{ matrix.shard != '' && format('-shard-{0}', replace(matrix.shard, '/', '-')) || '' }}🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In @.github/workflows/test.yml around lines 277 - 282, The artifact name for the "Archive Logs" step currently uses logs-${{ strategy.job-index }}-${{ matrix.device }}-${{ matrix.interface }}, which omits the shard; update the Upload Artifact step (name: Archive Logs, uses: actions/upload-artifact@v4) to append the sanitized shard value from matrix.shard (use the replace expression to substitute '/' with '-' e.g. replace(matrix.shard, '/', '-')) so the artifact name becomes logs-${{ strategy.job-index }}-${{ matrix.device }}-${{ matrix.interface }}-<sanitized shard>, ensuring shard information is visible and safe for GitHub artifact naming.
146-166: Useless use ofcaton line 155.
FAILED=$(cat tests/failed_uuids.txt | tr '\n' ' ')forks an extra process unnecessarily.♻️ Proposed fix
- FAILED=$(cat tests/failed_uuids.txt | tr '\n' ' ') + FAILED=$(tr '\n' ' ' < tests/failed_uuids.txt)🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In @.github/workflows/test.yml around lines 146 - 166, The assignment to FAILED currently uses a useless cat pipeline ("FAILED=$(cat tests/failed_uuids.txt | tr '\n' ' ')") which spawns an extra process; change it to read the file directly into tr or use shell builtins to avoid forking (for example use input redirection with tr like "FAILED=$(tr '\n' ' ' < tests/failed_uuids.txt)" or use readarray/mapfile to populate and join), leaving the surrounding logic (NUM_FAILED, retry invocation of mfc.sh, and TEST_EXIT handling) untouched and still referencing tests/failed_uuids.txt and the FAILED variable.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@toolchain/mfc/test/test.py`:
- Around line 193-196: The unguarded os.remove(failed_uuids_path) can raise
OSError and mask the original MFCException on the abort path; wrap the call to
os.remove (referring to failed_uuids_path) in a try/except OSError that either
logs the exception or quietly ignores it (consistent with the I/O block pattern
used elsewhere) so the original MFCException from the abort flow is preserved.
---
Duplicate comments:
In @.github/workflows/test.yml:
- Line 31: The quoting issue for the CPU count in the workflow step has already
been fixed; no code change is required—leave the step calling the script as
"./mfc.sh format -j \"$(nproc)\"" in the GitHub Actions workflow to preserve
correct quoting and shell expansion.
In `@toolchain/mfc/test/test.py`:
- Around line 222-229: The code writing/removing failed_uuids_path (using
failed_uuids_path, failed_tests, open(), os.remove()) is unguarded and can raise
I/O errors that would replace the intended exit(nFAIL); wrap the write and the
remove branches in try/except blocks that catch Exception, log or print the
error with context (including the path and exception), and ensure the script
still calls exit(nFAIL) after handling the I/O error so the test failure status
is preserved.
---
Nitpick comments:
In @.github/workflows/test.yml:
- Around line 277-282: The artifact name for the "Archive Logs" step currently
uses logs-${{ strategy.job-index }}-${{ matrix.device }}-${{ matrix.interface
}}, which omits the shard; update the Upload Artifact step (name: Archive Logs,
uses: actions/upload-artifact@v4) to append the sanitized shard value from
matrix.shard (use the replace expression to substitute '/' with '-' e.g.
replace(matrix.shard, '/', '-')) so the artifact name becomes logs-${{
strategy.job-index }}-${{ matrix.device }}-${{ matrix.interface }}-<sanitized
shard>, ensuring shard information is visible and safe for GitHub artifact
naming.
- Around line 146-166: The assignment to FAILED currently uses a useless cat
pipeline ("FAILED=$(cat tests/failed_uuids.txt | tr '\n' ' ')") which spawns an
extra process; change it to read the file directly into tr or use shell builtins
to avoid forking (for example use input redirection with tr like "FAILED=$(tr
'\n' ' ' < tests/failed_uuids.txt)" or use readarray/mapfile to populate and
join), leaving the surrounding logic (NUM_FAILED, retry invocation of mfc.sh,
and TEST_EXIT handling) untouched and still referencing tests/failed_uuids.txt
and the FAILED variable.
In `@toolchain/mfc/test/test.py`:
- Line 104: The guard currently uses str.isdigit() on elements of parts in the
conditional (the expression containing "len(parts) != 2 or not all(p.isdigit()
for p in parts) ..."), which allows non-ASCII numeric characters that int()
can't parse; update the predicate to use str.isdecimal() instead (i.e., replace
the all(p.isdigit() for p in parts) check with all(p.isdecimal() for p in
parts)) so only ASCII digits [0-9] are accepted before calling int() on parts[0]
and parts[1].
|
CodeAnt AI is running Incremental review Thanks for using CodeAnt! 🎉We're free for open-source projects. if you're enjoying it, help us grow by sharing. Share on X · |
|
CodeAnt AI Incremental review completed. |
Bot reviews (AI code reviewers) were triggering the benchmark workflow, and the concurrency group was cancelling the real benchmark run from the pull_request event. Gate the workflow early by skipping when the review author is a Bot account type. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…ty file check - Wrap os.remove() in try/except OSError on abort path so permission errors don't mask the real MFCException - Only pass --precision flag when matrix.precision is non-empty to avoid invalid bare -- argument - Use -s instead of -f for failed_uuids.txt to skip retry when file exists but is empty Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
User description
Summary
./mfc.sh cleanin Phoenix test scripts to prevent cross-compiler contamination from stale build artifacts--requeueto Phoenix SLURM jobs for preemption recoveryDepends on: #1170 (for
monitor_slurm_job.shand build script changes)Test plan
🤖 Generated with Claude Code
Summary by CodeRabbit
New Features
Tests
Chores
CodeAnt-AI Description
Shard GPU tests, add targeted retry and CI preemption/retry handling
What Changed
Impact
✅ Faster GPU test completion via parallel shards✅ Fewer flaky CI failures rerun (targeted retries for 1–5 failures)✅ Fewer preemption-related test interruptions on Phoenix (auto-requeue)💡 Usage Guide
Checking Your Pull Request
Every time you make a pull request, our system automatically looks through it. We check for security issues, mistakes in how you're setting up your infrastructure, and common code problems. We do this to make sure your changes are solid and won't cause any trouble later.
Talking to CodeAnt AI
Got a question or need a hand with something in your pull request? You can easily get in touch with CodeAnt AI right here. Just type the following in a comment on your pull request, and replace "Your question here" with whatever you want to ask:
This lets you have a chat with CodeAnt AI about your pull request, making it easier to understand and improve your code.
Example
Preserve Org Learnings with CodeAnt
You can record team preferences so CodeAnt AI applies them in future reviews. Reply directly to the specific CodeAnt AI suggestion (in the same thread) and replace "Your feedback here" with your input:
This helps CodeAnt AI learn and adapt to your team's coding style and standards.
Example
Retrigger review
Ask CodeAnt AI to review the PR again, by typing:
Check Your Repository Health
To analyze the health of your code repository, visit our dashboard at https://app.codeant.ai. This tool helps you identify potential issues and areas for improvement in your codebase, ensuring your repository maintains high standards of code health.